home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Garbo
/
Garbo.cdr
/
mac
/
hypercrd
/
hc1_2_x
/
fretext1.sit
/
Free Text Help_Services v1.01
/
card_21114.txt
< prev
next >
Wrap
Text File
|
1990-04-13
|
25KB
|
458 lines
-- card: 21114 from stack: in.01
-- bmap block id: 0
-- flags: 0000
-- background id: 7910
-- name:
-- part contents for background part 4
----- text -----
/* part 1 of 2 */
NOTES ON FREE TEXT INFORMATION RETRIEVAL
by Mark Zimmermann, March 1990
I've been thinking recently about some fundamental issues in big
free-text databases ... specifically, whether it will be feasible to
handle 100 MB/day or so of unstructured text (from wire services,
USENET, database downloads, scanned-in magazines, etc.) and do useful
research on it, the way that my FREE TEXT indexer/browser programs on
the Mac let you work with single files of tens of megabytes. I think
that the answer is "yes", 100 MB/day is manageable (for how long? -- a
month, at least?). But it will require some (straightforward, I hope)
extensions to my current indexer/browser systems.
This informal memorandum is a summary of my thoughts. It includes
comments on required features of the software, the technical design
decisions behind my current information retrieval (IR) codes, key
problems that must be solved to handle bigger free-text databases, and
some suggestions for how to implement solutions to those problems. My
goal is to provide a record of where my general free text IR thoughts
stand today, to help people who want to work with me on enhanced text
handling tools for researchers. I'd appreciate any comments or
suggestions or corrections that anybody wants to make!
REQUIREMENTS
I want real-time high-bandwidth responses to simple, common user queries
of the free-text database. By "real-time" I mean that single-term
retrieval requests should be answered in less than a second. (Complex
Boolean proximity searches can take a bit longer to set up and execute,
but even there, speed is important.) By "high-bandwidth" I mean that
the human has to get information back from the computer in a useful form
which can be evaluated quickly -- it's not good enough to return a list
of documents or files, each of which has to be paged through in order to
find the few relevant items. By "free- text" I mean that the input
stream can't be assumed to include any regular structure beyond words
separated by delimiters (spaces, punctuation, etc.).
I want tools for the earliest stages of research, when a person needs to
be able to browse and free-associate without even knowing what are the
right questions to ask. Users have to be able to work with the database
without intimate knowledge of the details of what's in it (since nobody
will have time to read much of a 100 MB/day flood of information). A big
database in a particular area has to be a "corporate memory" for groups
of scholars who are working on related topics. A new person has to be
able to come in and do research without extensive training, and an
experienced user has to find the system transparent, so that it becomes
an extension of her memory, not a barrier to getting at the data and
thinking with it. I see free-text IR tools as coexisting with other
types of tools, such as structured databases which are appropriate for
answering well-formulated queries at later phases of the research
effort.
I know of four fundamental end-user operations that a good real-time
high-bandwidth free-text IR system must support:
- viewing lists of words that occur in a database;
- selecting subsets of a big database to work within;
- browsing through candidate items efficiently; and
- reading and taking notes on retrieved information.
I'll discuss each of these four tasks below.
WORD LISTS
Viewing lists of words that occur in a database is exceedingly useful,
for many reasons. An alphabetized word list makes it easy to spot many
misspellings and variant suffixes for search terms. Word lists are
surprisingly useful in working with mixed-language databases,
particularly in spotting cognates and words that begin similarly to
related native-language terms. The ability to view a word list, along
with numerical information about occurrence rates in the whole database
and in subsets, is valuable in formulating better queries that maximize
good data retrieval. Perhaps most importantly, just being able to
scroll around in a list of terms is a superb aide-memoire in thinking of
words worth looking for in a database. Visible word lists are much
better than blind searching, even when wild cards are permitted. It's
precisely analogous to the "recognize vs. remember" dichotomy between
menu-driven and command-line user interfaces.
Word lists are easy to generate, provided that one has built a complete
inverted index to the database being browsed. For example:
1 8805251042
1 9511
6774 A
1 AA01693
1 AA03935
2 AA13961
1 AA29900
34 ABLE
291 ABOUT
29 ABOVE
Each word is preceded by the number of times it occurs in the database
as a whole. Note the presence of numbers and document identifiers as
well as ordinary words.
A word list doesn't need to take up much storage space as files get big,
since the number of distinct words grows much slower than the file size.
There are various rules of thumb and theoretical formulae, such as
Zipf's Law, for estimating how big the word lists will get. In my
experience, unique dates, other numbers, garbles, typos, etc. tend to
keep the word lists growing more than expected, but not at an
unmanageable rate.
My free text IR programs use an extremely simple file structure, which
makes generating word lists nearly trivial. In the "keys" index file, I
store as an array the words found in the database along with count
information. Thus, a single disk access can fill a small window with an
alphabetized word list along with each word's frequency of occurrence in
the database as a whole. (See the discussion under BROWSING below for
information on how counts are displayed when working in a subset of the
database.)
In my programs, I currently truncate words rather arbitrarily after 28
characters; I keep 4 bytes of cumulative count as a long integer; I
default to a standard character mapping from lower-case to upper-case; I
default to break words apart at punctuation marks; and I default to
alphabetizing according to the ASCII character set order. These choices
work well for English-language texts in my experience, and they're
easily changed if necessary by recompiling the indexing and browsing
programs. (In a future version of my programs, I intend to allow at
least the character mapping and alphabetization order to be defined
separately for each database; see the EXTENSIONS AND ENHANCEMENTS
section below.)
These design choices mean that each distinct word in a database occupies
32 bytes in the "keys" index file. It is possible to save a little
space by using a more complex data structure, but it would be at the
cost of speed and simplicity in the indexing and browsing programs. In
fact, as databases get larger the storage used by the "keys" file
becomes increasingly insignificant compared to the database and the
other part of the index (the "pointers" file, discussed under SUBSETS
and elsewhere below). The four-byte cumulative count element of the
data structure will limit the indexed databases to 2^32 total words.
That's about 4 billion words, or roughly 25 GB total file size, given a
typical word length of 5 letters plus 1 delimiter. Modifications to
this part of the index data structure will thus be necessary for very
large free-text databases (see EXTENSIONS AND ENHANCEMENTS).
I index every word in the database, even common words such as "a" and
"the". Many free-text IR systems call such terms "stop words" and omit
them. The result is a saving of perhaps 20% - 40% in final index file
size, depending on how long the stop word list is. The cost, however,
is too great -- it becomes impossible to search for common terms in
proximity to each other or as components of names, part numbers, dates,
and so forth. A user can never find "To be, or not to be..." in
Shakespeare, since every word in that phrase is a stop word! It also
becomes impossible to look for "A. I. Smith" in a long name list, or to
find many short Chinese or Japanese words at all.
I am inalterably against the use of stop words; I also oppose "stemming"
or "truncation" in free-text indexing. Many conventional systems
attempt to remove suffixes so that they can index various forms of a
word under a single entry; for instance, "compute", "computer",
"computers", "computational", and "computing" might all be merged into a
single "comput" entry. This saves a small amount of space, but at the
expense of simplicity and control for the user. (At times, I may need
to retrieve "computers" but not "computer".) Many stemming systems also
get confused easily and garble words (particularly proper nouns) --
occasionally with disastrous results when a truncated term lands on the
stop word list and is left out of the index entirely! Foreign terms and
oriental names are particularly vulnerable to butchery by such systems.
SUBSETS
Researchers need to be able to select subsets of a big database to work
within. As information collections get very large, there will still be
unique or nearly-unique terms that people will want to retrieve quickly
-- the last occurrence of a particular name, a bizarre word or document
identifier that only occurs once, etc. But increasingly often, there
will be too many instances of an important word for convenient browsing
(but see the BROWSING section below for methods to delay that point).
It thus becomes necessary to filter the input stream in a fast and
flexible way, so that a human- usable volume of information is left to
study.
Proximity search is a common free-text IR approach to data filtering.
The usual proximity criteria, however, tend to be narrow, inflexible,
and (in my experience) extremely hazardous to retrieval when applied to
real-world databases. A conventional form of proximity search is to
enter a boolean query such as "FREE(w2)TEXT AND (INFORMATION OR
RETRIEV*)" which translates into "give me any document with FREE
occurring within two words before TEXT, which also contains INFORMATION
or any word beginning with RETRIEV". Of course, this example can't work
at all if the database isn't already divided up into "document" units.
Even if the query appears to work, the user will fail to see an
unknowable number of near-miss items which contain some but not all of
the specified words, or which use alternative spellings (including
typographical errors).
The user interface for conventional boolean proximity search is simply
terrible. It doesn't provide enough feedback to let the user avoid
subtle (or obvious) mistakes, it requires too much human memory and
creative energy in query formulation, and it gets in the way of seeing
the data. Often, an apparently-innocuous term in a parenthesized
boolean search statement actually distorts the entire meaning of the
query and eliminates all the useful information that was to be
retrieved. Expert searchers can overcome some of the barriers that a
command-line proximity search interface imposes, but it's never easy.
In my browser programs, I take a different approach. I let a user
define a neighborhood of interest in a free-text collection based on
loose proximity to words chosen from the full database word list (see
WORD LISTS above). For reasons of speed, simplicity, and convenience,
the subset proximity criterion is in bytes (characters of text), with a
default resolution of 32 bytes. Thus, a searcher could ask a fuzzy
equivalent of the above sample boolean query by starting out with an
empty subset and then adding the neighborhood of the word "FREE". This
is actually done by clicking on the word "FREE" as it is displayed in an
Index View window -- a perfect opportunity to see nearby alternative
search terms! The default neighborhood is "within-a-few-words", roughly
plus or minus 50 bytes. At this point, an excerpt from the database
subset's word list might be displayed as:
5/34 ABLE
0/291 ABOUT
0/29 ABOVE
0/5 ABSTRACTS
9/41 ACCESS
(Here, the term "ABLE" occurs 34 times in the database as a whole, of
which 5 instances are in the proximity neighborhood of "FREE".)
Next, after making a similar selection of the neighborhood around the
word "TEXT", the user can narrow down the subset of interest to the
intersection of the "FREE" and "TEXT" subsets. (This equivalent of the
boolean AND operation is again done with a mouse click.) Then,
broadening the neighborhood setting to 500 bytes or so, the user can
with a few more clicks define another subset of words very loosely near
INFORMATION or RETRIEVAL or RETRIEVALS or RETRIEVING, etc. Intersecting
that subset with the FREE TEXT subset gives the final subset of interest
for interactive browsing.
My experience, and that of many users, is that my kind of fuzzy
proximity search is far better than classical boolean search for finding
the information that one really wants. It's faster, safer, and far more
aesthetic to use as well. The mental model that the user has is simple
and powerful: neighborhoods of influence that radiate out from
interesting words, and the possible combinations of those neighborhoods.
The proximity search interface, after a few minutes of use, becomes
transparent and doesn't get in the way of thinking with the data.
My current indexer/browser programs assume that the free-text database
is a single file. That assumption is straightforward to lift, and I
plan to do so soon (see EXTENSIONS AND ENHANCEMENTS). But it is very
convenient, for many purposes, to treat the entire database as a linear
entity, and refer to byte offsets from the beginning of the database.
Databases which have additional structure (sections, chapters, pages,
etc.) can easily be handled within such a framework.
My current implementation of fuzzy proximity search uses vectors, linear
arrays, to represent the database subsets. Each bit of a subset vector
stands for a 32-byte (default value) section of free text. A bit is
cleared when its section is not included in a given neighborhood, and a
bit is set when its section is part of the subset. This approach of
representing 32 bytes with 1 bit results in a compression factor of 256,
so a gigabyte database can have a proximity neighborhood defined using a
4 MB vector. Subset operations are trivially simple with this data
structure. (See the EXTENSIONS AND ENHANCEMENTS section for a
discussion of what changes may be needed to handle multi-gigabyte
databases.)
In order to set or clear bits in defining subsets, it is obviously
necessary to have pointers to the locations of every word in the
database. (It is necessary to have such pointers in any event to
retrieve information quickly, of course.) The second index file that my
programs use, the "pointers" file, is precisely that. In the current
implementation, it simply consists of 32-bit integers giving the offset
from the beginning of the database to each indexed word. Entries in the
pointer file map directly to proximity subset vectors, making it easy to
set and clear bits as needed to define regions of interest in the
database. (See the EXTENSIONS AND ENHANCEMENTS section for more on how
this model can grow to handle larger data collections.)
Each instance of a word thus requires 4 bytes of space in the "pointers"
file. Since an average word is 5 letters long, plus one or more
terminal spaces or other delimiters, the average "pointers" file is less
than 80% of the size of the original database text file. It would
certainly be possible to save some disk space by compressing the
"pointers" file, but I doubt that more than half of the original file
size could be recovered. There would also be a price to pay, in terms
of slower retrieval, more severe database size limitations, and
increasingly complex (and less reliable) software. I thus chose to keep
the "pointers" file structure as simple as possible.
Many conventional free-text IR systems, I suspect, generate in-memory
lists in order to do proximity search operations. They tend to fail when
looking for common words in conjunction with each other, probably when
their data structures overflow. (I don't know for sure, as their
algorithms are proprietary and unpublished; I deduce their design by
observing their common fail cases.) I much prefer my subset vector
approach. It is simpler, it has guaranteed good run time behavior, and
it never breaks down when working with dense database subsets.
Conventional free-text IR systems also (I suspect) keep lists of
sentence and paragraph and document delimiters, in order to handle some
types of proximity search requests. I have never found it necessary (or
even desirable) to pay much attention to rigidly-defined boundaries such
as sentences or paragraphs in actually performing proximity searches.
The fuzzier proximity search that I have implemented is, in actual
experience, much safer and more likely to find the key data items that
are being sought. A user of my system will never lose a good piece of
information due to an unfortunate sentence division or paragraphing
choice by an author. As a secondary benefit, there is no need to
pre-process items flowing into the database in order to insert
paragraph, section, or document boundary markers before indexing.
BROWSING
The key to effective use of a free-text database is the ability to
browse rapidly through many items in order to find the few that are of
greatest interest. A good free-text IR system has to provide lists of
indexed words (see WORD LISTS above) and it has to fetch the full text
of documents rapidly upon demand (see READING below). But a great
free-text IR system must also provide facilities to bridge the gap
between word lists and full text retrieval. I do this in my programs by
applying an old concept, the "key word in context" display, in a new way
which I call a "context view".
A "key word in context" listing, sometimes called a KWIC, is a set of
lines extracted from a database which show the instances of a word's
occurrence, each with some context on either side. The "key word" is
lined up in a column down the middle of the page. Static, printed KWIC
indices have been around for many years, and are better than no index at
all, but they don't provide fast, transparent access to data. A user
still has to track down the actual reference in order to get back to the
full text of the item, and she still has no way to look at a subset of
the original database.
My context views are key-word-in-context displays which are generated on
the fly, at run time, in response to user requests. They look like a
common KWIC -- for example, a context view of six occurrences of
"Macintosh" in a small file might reveal:
rogram, without the Macintosh user interface features. Ho
her features of the Macintosh Browser: - easy access to
megabytes/hour on a Macintosh Plus, and over 60 megabytes/
[75066,2044]. The Macintosh Browser program is available
O-MAC> and the MAUG Macintosh Users' Forum on CompuServe.
send me a formatted Macintosh disk and a self-addressed st
Tabs, carriage returns, line feeds, etc. are all suppressed in a context
view, to allow the user to see the maximum amount of information on each
line.
A person can scroll around in a dynamic context view in a window on the
computer's screen, and quickly eliminate many of the uninteresting
usages of a word. When an item looks promising, a mouse click
immediately calls up the actual text from the original document for
reading and possible further use. (See READING below for further
comments.) When working in a proximity subset of the whole database, the
context view only shows the lines surrounding word instances in the
subset of interest.
My experience indicates that a human can comfortably browse through
hundreds of lines in a context view in order to locate a few good ones
to retrieve. It's at least an order of magnitude easier (and more
pleasant!) than wading through full text documents as they are
conventionally retrieved by IR systems.
The current implementation of context views in my FREE TEXT
indexer/browser programs is straightforward. For each line of context
to be displayed, a single disk access is needed to fetch the characters
from the appropriate point in the database file. (Since the entries in
the "pointers" file are already alphabetized and sorted together, one
access to that file provides many windows worth of context view
pointers; the cost of that access is thus negligible compared to the
fetching that has to be done from the text database file.) With some
operating-system caching of disk sectors as they are used, scrolling of
a context view window is reasonably smooth.
In order to guarantee a fast response to user requests when working in a
subset of the entire database, I currently apply a couple of simple
tricks. In generating the word list, for example, I give exact counts
of the number of words in a subset only if that word occurs less than
100 times (default value). Thus, for example, an entry in a word list
might be:
~11% A
I take a sample of the occurrences of "A" in the database and report
back a percentage estimate of the fraction in the subset. (To be
precise, I default to counting the last 100 instances in the database.)
In generating a context view, when more than 100 (default value)
instances of a word are skipped over because they fall outside the
subset of interest, I display an indication of that as a line of dots in
the context window; for example:
..........................................................
..........................................................
megabytes/hour on a Macintosh Plus, and over 60 megabytes/
..........................................................
If I didn't use this tactic, a user might try to retrieve in an empty
subset and have to wait for the entire database to be scanned before a
"no valid instances" message could come back. I find the quicker
response of my approach superior, and I also find it useful to see an
indication (by lines of dots) when I'm working in a "desert", an
almost-empty subset.
READING
The bottom line for a free-text IR system is to get a desired document
(or excerpt) in front of the user, so that it can be read and worked
with. It goes without saying that a good IR system must make it easy to
scroll around in a database, copy out selections, and paste them into
working papers, other programs, or anyplace else that the user desires.
Handling this interprocess communication is a job for the operating
system, but the IR program must be at least open enough to make it very
easy to export data to other programs. If the original document in the
database has some structure (and thus isn't purely "free text") it may
be valuable to reflect that structure in the retrieved data. Chapter or
section titles can be displayed, for instance, or if there are graphics
or font style changes they can be shown properly. The ultimate goal is
to have the original document (or a high-resolution image of it) appear
in front of the user, open to the right page, with a highlight to draw
attention to the key part. That's unfortunately not easy to do today!
My current systems go only partway toward meeting this goal. I display
text and highlight the index term that was the original basis for the
retrieval. But I only extract a small chunk (a few thousand characters,
by default) of the text document and let the user scroll around in that.
It's hard to do more and still maintain the speed and simplicity of the
software. I don't handle non-textual elements of the database at all,
and I only display a single font.
Beyond mere retrieval, an ideal free-text IR system would make it easy
to annotate retrieved information, add valuable comments and leave a
trail behind so that other explorers (or the same explorer, at a later
date) could profit from previous experience. Hypertextual links between
documents should be easy to create (as one type of annotation, perhaps)
and should be easy to see and follow. They would then provide another
set of paths through the data space. All of these topics are under
active research today, and perhaps within a few years robust,
transportable systems to do such jobs will be available. In the
meantime, however, my top priority is to do a better job at the
fundamental task -- fast, easy, open, good real-time high-bandwidth
free-text information retrieval on big databases.
/* end of part 1 of 2 - continued on next card */